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We analyze publicly available data on Affymetrix microarrays spike-in experiments on the human 
HGU133 chipset in which sequences are added in solution at known concentrations. The spike-in 
set contains sequences of bacterial, human and artificial origin. Our analysis is based on a recently 
introduced molecular-based model [E. Carlon and T. Heim, Physica A 362, 433 (2006)] which takes 
into account both probe-target hybridization and target-target partial hybridization in solution. 
The hybridization free energies are obtained from the nearest-neighbor model with experimentally 
determined parameters. The molecular-based model suggests a rescaling that should result in a 
"collapse" of the data at different concentrations into a single universal curve. We indeed find such 
a collapse, with the same parameters as obtained before for the older HGU95 chip set. The quality of 
the collapse varies according to the probe set considered. Artificial sequences, chosen by Affymetrix 
to be as different as possible from any other human genome sequence, generally show a much better 
collapse and thus a better agreement with the model than all other sequences. This suggests that 
the observed deviations from the predicted collapse are related to the choice of probes or have a 
biological origin, rather than being a problem with the proposed model. 
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I. INTRODUCTION 

DNA microarrays H allow to measure the gene ex- 
pression level of thousands of genes simultaneously. This 
is a major step forward compared to traditional meth- 
ods in molecular biology (as Northern blots) which are 
applicable only to a limited set of genes at a time. The 
determination of gene expression levels is not the only ap- 
plication of DNA microarrays, which have been used also 
for the analysis of genetic variance between individuals 
(single nucleotide polymorphisms), as efficient tools for 
DNA sequencing, for the study of chromosomal defects 
and for the determination of alternative splicing events. 

Despite the increasing popularity that microarrays 
have known in the recent years there are still some prob- 
lems with the technology. There has been, for instance, 
only a moderate effort in comparing different microarrays 
platforms on the same biological system 2j. When this 
comparison was made, as in a recent study on expression 
analysis of stressed-out pancreas cells, it was found that 
different commercial platforms produced wildly incom- 
patible data . These problems call for a better funda- 
mental understanding of the functioning of the microar- 
rays. Such understanding will help researchers to design 
better algorithms for microarray data analysis based on 
the physical-chemistry of the underlying hybridization 
process. 

In the past years several experiments were addressing 
the analysis of equilibrium and dynamical properties of 



DNA hybridization to probes anchored on solid surfaces 
with different techniques as, for instance, surface plasmon 
resonance Q and by quartz microbalance B. At the 
same time several papers 0, S B S ^> ll j^have been 
dedicated to theoretical aspects of hybridization, mostly 
discussing the Langmuir model and variances thereof. 

In a previous paper |2 we have analyzed a series 
of publicly available data of experiments performed on 
Affymetrix microarrays, using a simple model of the hy- 
bridization process. In these experiments a set of se- 
lected genes are "spiked-in" at fixed concentrations into 
a solution containing other types of RNAs. This set of 
data has been widely used as testground for algorithms 
designed to extract gene expression levels from the raw 
data. Affymetrix is one of the major commercial pro- 
ducers of microarrays. In Affymetrix arrays the surface- 
bound probes are prepared in situ by photolitographic 
techniques. Although the technique is limited to rather 
short oligos (25 nucleotides long) one of the advantages 
is that a high density of probe sequences per array can 
be obtained. In the latest generation 1,400,000 differ- 
ent probes have been placed in a single array. The large 
number of probes compensate for their limited length. 
Indeed Affymetrix uses multiple probes per gene, which 
define a probe set. Another special feature of Affymetrix 
chips is that it uses as control a mismatch (MM) probe 
sequence, which differs from a perfect-matching (PM) se- 
quence only at the base at position 13: a nucleotide A is 
interchanged with T and a nucleotide C is interchanged 
with G. 
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FIG. 1: (a) The simple model of hybridization in Affymetrix 
microarrays used throughout this paper is defined by two ba- 
sic reactions: 1) Hybridization between target molecules (t) 
to surface anchored probes (p) leading to a duplex pt and 2) 
The hybridization between target molecules in solution lead- 
ing to the partial duplexes tti,j. In the model, the effect of 
the hybridization in solution amounts to a reduction of the 
original target concentration c to a value ac. (b) Partial hy- 
bridization of a fragment in solution complementary to the 
target RNA sequence from base i to base j {1 < i < j < 25). 



In our previous work we focused on the spike-in 
data set of the HGU95 human chipset. More recently this 
has been substituted by the HGU133 chipset. Probe sets 
have been completely redesigned in the HGU133 chipset; 
moreover there are only 11 probes per probe set com- 
pared to the 16 probes of the HGU95 array. In this pa- 
per we focus on the analysis of publicly available spike-in 
data on the HGU133 chip, building on our previous work 
[T^ on IIGU95. This will allow us to test the robustness 
of the model introduced in Ref. to a new set of data. 
There is another interesting feature of the spike-in data 
of the HGU133 chipset: differently from the HGU95 data 
where spikes correspond to human genes, the spikes in the 
HGU133 have been selected between human, bacterial 
and "artificial" sequences. The latter were selected by 
Affymetrix to avoid cross-hybridization with any known 
human coding sequence. 



II. A SIMPLE MODEL FOR HYBRIDIZATION 
IN AFFYMETRIX ARRAYS 

In this section we briefly recall the model introduced in 
Ref. (l2|- Two basic processes are considered: 1) Target- 
probe hybridization and 2) Target-target hybridization in 
solution. According to the model the fluorescence signal 
measured from a given probe is: 



1 + acel^^(^ 



(1) 



where /q indicates a background level due to non-specific 
hybridization, A sets the scale of intensities, c is the 
target concentration (a measure of the gene expression 
level), AG the target /probe hybridization free energy, 
(j = 1/RT the inverse temperature, R the universal gas 



constant. Here, a models the reduction in the concen- 
tration of available targets due to the target-target hy- 
bridization in solution: only a fraction ac is available for 
the hybridization with probes as the remaining (1 — a)c 
form stable duplexes with other partners in solution (see 
Fig.Uta)). 

In the model introduced in Ref. [l^l, we approximate 
the target-target hybridization with the expression 
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l-hcexp(/3'AGg^^) 



(2) 



with (3' and c fitted parameters and AG^^"* = 
AGij(l,25) the (sequence dependent) RNA/RNA free 
energy for duplex formation in solution at 37 degrees 
calculated over the whole 25-mer length; in close approx- 
imation, the binding free energies at 37 and 45 degrees 
(the actual experimental temperature) are almost identi- 
cal, apart from a small scaling factor, which is adsorbed 
into the rescaled temperature /?'. In the next section, we 
will discuss the steps leading to Eq. (O in more detail. 

The model defined in Eqs. ^ and Q contains the 
four fitting parameters A, /3, j3' and c which were fitted 
against the spike- in data of the Affymetrix array HGU95a 
in Ref. The parameters /3', c and A will be dis- 

cussed in Sec. IIIII and Sec. IIVI The parameter /3 is 
the inverse temperature. Instead of fixing it to the ex- 
perimental value we have kept it as a fitting parameter 
as explained in Ref. |12| |. The hybridization free ener- 
gies AG and AG^, are calculated from tabulated experi- 
mental data for DNA/RNA 13, 14] and RNA/RNA [H 
duplex formation in solution. 

We note that we fit mismatches and perfect matches 
with the same model. The difference between the two 
is that there is a different hybridization free energy AG: 
one expects a lower signal for mismatches compared to 
perfect matches, due to weaker binding. This is not 
always the remarked in several studies for a 

substantial fraction of probes (30%, as reported in Ref. 
0) one observes "bright mismatches" for which the mis- 
match intensity /mm exceeds the intensity /pM of the 
perfect match. However, it has been observed 11] that 
bright MM come predominantly from probes with low 
intensity, which suggests that bright mismatches are as- 
sociated with weak specific hybridization when the signal 
/ is dominated b y /n in Eq. (^). 

In recent work |l6| we also compared the current model 
with the approach based on position-dependent effective 
affinities as for instance described in Refs. H,^3. The 
conclusion is that the two approaches are fully consis- 
tent with each other, provided that various effects are in- 
corporated such as partial unzipping of the probe-target 
complex, less than 100% efficiency in the probe growth 
during lithography, and entropic repulsion between the 
target and the substrate. These additional effects are 
the main factors causing position-dependence (and thus 
allowing for a comparison with position-dependent effec- 
tive affinities) ; for a quantitative prediction of the inten- 
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sities, their combined effect can be well approximated by 
a slight decrease of /3 in Eq. ^ and they are therefore 
not included in the current study. 



III. ON THE HYBRIDIZATION IN SOLUTION 

We now discuss the approximations leading to the form 
of a. We denote the concentration of free 25-mer targets 
in solution as [t] , the concentration of free target strands 
that are complementary from nucleotide i up and includ- 
ing nucleotide j as [iij], and the concentration of du- 
plexes between these two as [tti_j]. Chemical equilibrium 
(see Fig. E^b)) yields for the equilibrium constant: 



[tti.j 



(3) 



where AGuihj) is the RNA/RNA hybridization free en- 
ergy for target molecules in solution, which arc com- 
plementary from nucleotide i up and including j, and 
/3 = 1.59 mol/kcal (corresponding to the experimental 
temperature of 45 degrees). For a given gene, the mea- 
sure of the gene expression level which one wants to de- 
termine is the total target concentration c given by 



(4) 



Solving Eqs.Q) and Q we find for the fraction of single 
stranded target in solution: 



a/ = 



W ^ I 

c l + E.,,[iM]exp(/3AGfl(z,j))' 



(5) 



Note that the summation in the denominator of Eq. (O 
was replaced in the approximate expression Eq. by 
the single term cexp(/3'AG^^''), with fitting parameters 
c and f3'. 

Eq. jSJ requires as input estimates of the concentration 
[tij] of complementary sequences with length I = j — 
present in solution. Assuming that all four nucleotides 
are roughly equally abundant, and that there are no cor- 
relations along the sequence, the abundance of short se- 
quences with length I will decrease as [tij] ~ 4~'. This 
scaling breaks down beyond some length L; assuming 
for the human transscriptome a total length of 10^ nu- 
cleotides, a random sequence longer than 12 is more likely 
not present at all, since 4-"^^ > 10^. We therefore take as 
our approximation 



Co •4-(^-*) for j - i < 12, 
otherwise. 



(6) 



Here, cq is a measure of the RNA concentration. Using 
this approximation for the concentration of complemen- 
tary strands, we can now compare Eqs. (|2Jl and |(SJ. Fig. 
121 shows the more elaborate model Eq. (|3J) as a function 
of the approximate form Eq. (O, with the values for the 



30 



20 



^ 10 




20 



30 

log(a"-l) 



40 



FIG. 2: Comparison of the summation in Eq. equal to 
aj^ — 1, and its approximation in Eq. I^J, equal to — 1, 
for the first 1,000 spike-in sequences of HGU133. Note that 
a change in co corresponds to a vertical shift over log(co); in 
this figure, we used cq = 1. The straight fine is a fit, given by 
y = x + b with b = —14.1. 



fitting parameters (3' and c taken from Ref. There 
is a reasonable agreement between the two. 

Since Eq. jSJ has a better microscopic foundation than 
Eq. (121, it should in principle allow for a better estimate 
of the hybridization in solution. There are however se- 
vere limitations to the use of Eq. Q . In the hybridiza- 
tion in solution, there is a competition between the con- 
tributions of short sequences, which are abundant but 
have a low affinity, versus long sequences, for which the 
concentration is low but the affinity high. The concen- 
tration drops on average approximately by a factor of 
4 per added length (see Eq. ©), but the affinity grows 
by approximately (AG) « 2 or 3 kcal/mol, the average 
value of RNA/RNA interaction parameters 0|. Since 
exp(/3(AG)) > 4, the longer sequences dominate the hy- 
bridization in solution. However, as discussed above, be- 
yond length L 12, there simply are no complemen- 
tary strands. The accuracy of the more elaborate model 
Eq. Q thus hinges crucially on knowing the longest com- 
plementary strand which is transcribed, as well as its 
affinity and its concentration. Since the approximate 
model Eq. (refalpha) is not expected to perform worse 
than the more elaborate model Eq. (O, we keep using 
the former. 

The data points in Fig.|21can be fitted by a straight line 
with slope 1: the value of = 0.67 mol/kcal in Ref. [l^ . 
corresponding to 725 K, apparently is the appropriate 
value to describe the experiments at a temperature of 
45 degrees. The offset in the straight-line fit is equal to 
log(c)— log(co). Since the straight-line fit has an offset of - 
14.1, and since we used the fitted value of c = 2 • 10~^pM 
in Ref. an estimate of the RNA concentration is 

Co = exp(14.1)-c = 30 nM. Even if we do not use the more 
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FIG. 3: Plot on intensity vs. concentration for three spike-in 
genes of the HGU133 chipset, /max indicates the saturation 
value obtained from a three parameters (7o, A and K) non- 
linear fit based on Eq. 0. 



elaborate model Eq. (O , it provides us with a microscopic 
basis for the values of the parameters P' and c in the 
approximate model Eq. 



IV. ON THE SIGNAL SATURATION LEVEL 

If the target concentration c and the binding energy 
AG are sufficiently high, the Langmuir isotherm satu- 
rates to a maximal value. From Eq. we find for 
cexp(/3AG) > 1 



/max =Io+ARiA, 



(7) 



where we have used the fact that typically the back- 
ground level, Jo, is much lower than the value of A. The 
saturation intensity arises if targets are bound to almost 
all probes. Since the number of probes does not vary be- 
tween the sequences being measured, this saturation in- 
tensity is also expected to be sequence-independent, and 
more specifically, should not distinguish between perfect 
matches and mismatches. A recent analysis of the Latin 
square set f^, [isj reported widely different values for the 
saturation intensity. It is worth clarifying further this 
issue here. 

The obvious procedure to determine the saturation in- 
tensity, is to look at the intensity of a probe as a func- 
tion of concentration. Assuming an effective affinity Ks 
for probe sequence s, the intensity /s(c) as a function of 
concentration c is given by 



/s(c) = /o,s + 



A,cK, 
1 + cKs 



(8) 



in which /o,s is the (sequence-dependent) background in- 
tensity due to non-specific binding. A plot of Ig vs. c for 
two probes of the IIGU133 spike-in set is shown in Fig. 



10 

o 

10^ 

o 

10^ 



7-^-5 : 



11<K^ 1 4 



204513_s_at 



6 ^ ^4 -^^ 



11 



. 10 



' 10" 



204414_at 
11 



24 28 32 36 24 28 32 36 



— : 10 



. 10 



207540_at 



AFFX-r2-TagG_at: 



10" 



24 28 32 36 24 28 32 36 

AG* (kcal/mol) AG* (kcal/mol) 

FIG. 4: Plot of J - Jo as a function of AG - RT log a for 4 
sequences spiked-in at a concentration of c = 512 pM. The 
numbers indicate the probe set numbers. Smaller characters 
are used for the MM signals. Solid lines represent the Lang- 
muir model as given by Eq. ((2J. The data are consistent, 
except few outliers, with the Langmuir model with roughly 
constant saturation level A f» 10*. 



13 Taking Iq, A and K in eq. © as fitting parameters, 
and extrapolating to high concentration then yields the 
saturation intensity. 

Two research groups 0, 0| followed this procedure, 
and both found saturation intensities that vary wildly 
between different sequences. A first effect that can cause 
deviations from the Langmuir fit Eq. © is that the litho- 
graphic process, through which the probes are synthe- 
sized in situ in Affymetrix chips, is not 100% efficient. As 
estimated by Burden only about 10% of the probes 
reach the full length of 25 nucleotides. At low intensi- 
ties far from saturation, the incomplete probes can be 
safely ignored since their affinity is much lower than that 
of the fully grown probes. However, under conditions 
where the fully grown probes are saturated, clearly there 
will be contributions to the fluorescent intensity from the 
almost complete probes, and an even further increase in 
concentration will bring into play shorter and shorter in- 
complete probes. Consequently, the Langmuir fit Eq. ^ 
breaks down near saturation; extrapolation to high con- 
centration is an unreliable procedure. 

A second cause of worry is that comparing fluores- 
cent intensities from different chips is also potentially 
unreliable, since the microarrays might have undergone 
slightly different processing during the washing and stain- 
ing. Since Affymetrix microarrays cannot be reused, the 
spike-in measurements used in Refs. 0, E| required a 
new chip for each concentration. 

To avoid these two potential sources of error, we there- 
fore consider the intensities for a given probe set at a 
specific concentration, i.e. c constant and AG and a 
variables in Eq. Q ■ The data belong to the same array. 
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FIG. 5: Histograms of the PM and MM intensities for the Latin square experiments in log-log scale for the chips HGU95a (a) 
and HGU133 (b). The plots contain 19 histograms referring to different experiments (a) and 12 experiments (b). The dashed 
lines are positioned at 7 = 10000 and / = 15000 (intensities are given in Affymetrix scale). Insets: histograms of the total 
intensity of PM and MM together. 



An example of this type of analysis is shown in Fig. 2|for 
a concentration of c = 512 pM. On the horizontal axis 
we plot AG* = AG — RTloga. The solid lines are given 
by the Langmuir curve Eq. Note that the large 

majority of the probes align along the expected curve, 
with few exceptions as for instance probe 11 (both PM 
and MM) for the probe set 204414_at. Therefore, the 
data are consistent with a value of A roughly constant in 
Eq. which suggests indeed that the large variations 
in /max obtained from the extrapolations of the data in 
the earlier analysis are more likely to be an artifact of 
the extrapolations. Note however that some variability 
of the saturation level can be seen in the data of Fig. 
01 Typically this variability is of about 20%. In order 
to keep the model simple we will keep A constant in the 
rest of the paper. An interesting possible explanation of 
the variability of A has been given in Ref. 18], i.e. that 
this variation is due to the post-hybridization washing of 
the array. 

Yet another different way of addressing the issue of 
the saturation intensities is to analyze the histogram of 
the intensities on the whole chip, as in Fig. [31 which 
shows both the intensities for the HGU95 and HGU133 
spike- in data. To reveal the data at high intensities, they 
are plotted in a log-log scale. In the figure we note a 
drop in the histogram around / ?» 10 000, sharper in 
the HGU133 chipset, which is consistent with the esti- 
mate of the saturation intensity obtained from the fits 
of intensities vs AG — i^Tloga, as given in Fig. 0] 
Note that in Fig. [SJb) the drop is 100-fold in the range 
10 000 < / < 15 000, which suggests that the data are 
consistent with a roughly constant value of the satura- 
tion. However a more close inspection of the histogram of 
the HGU133 for PM and MM intensities separately, re- 
veals that the estimated saturation value for the two may 
be different. In the case of PM intensities alone the drop 



is rather sharp at around / « 10 000, however the MM 
intensities seem to saturate at lower intensities, which is 
not seen in the IIGU95 data (Fig. Ufa)). The number of 
MM probes reaching an intensity close to the saturation 
level in the histogram of Fig. El^b) is quite small so the 
fact that the the MM and PM reach a different saturation 
level cannot be concluded for sure. 

Also the low-intensity side of the histograms in Fig. [51 
contain interesting information. Both for the IIGU95 and 
HGU133, the intensity drops steeply below a minimal 
intensity. For HGU95, this drop occurs around /min ~ 70, 
while for HGU133 the drop occurs around Jmin ~ 30. 
This increase of the dynamical intensity range by more 
than a factor of two is a clear demonstration of the fast 
rate of improvement in microarray technology. 

V. ANALYSIS OF DATA COLLAPSES 

As a test of the validity of the model we plotted 
the data as a function of the rescaled variable: 

x' = aceP^°. (9) 

If the model is to be trusted the data for different values 
of c and different probe sequences (i.e. different AG and 
a) ought to "collapse" onto a single master curve 

At' 

I-Io = ^. (10) 
1 + x' 

This collapse has indeed been observed in the large ma- 
jority of the spike- in genes of the HGU95a chipset [T^ . 
Interestingly, the very few outliers observed in that case 
could be explained as annotation errors or unbalance of 
free energies used for specific nucleotides, as discussed in 
Ref. 113. 




FIG. 6: Collapse plots for the 4 bacterial and the 8 artificial sequences of the HGU133 spike-in set. In these plots the 
background subtracted intensities for a given probe set are plotted as functions of the rescaled variable x' given in Eq. Q. 
The data corresponds to all spike-in concentrations for a given probe sets. Solid lines correspond to the Langmuir isotherm. 
Compared with the human and bacterial sequences the artificial sequences are characterized by the best collapses. 



We choose here the same fitting parameters used in 
Ref. 13 for the HGU95 chipset, that is: ^ = 10 000, 
/? = 0.74 mol/kcal, /3' = 0.67 mol/kcal and c = lO^^ pM. 
These parameters fit equally well the HGU133 spike-in 
data. 

In Figs. El and IHl we show the collapse plots for all 
the 42 genes of the spike-in data set HGU133. Each plot 
contains about 200 points, which all tend to cluster (in 
some cases much better than others) along the Langmuir 
curve Ax'/ (1-1- a;'). All the 13 concentrations, which range 



TABLE I: List of values of (w) and for the bacterial and 
the artificial sequences in the spike-in set HGU133. 



Probe set 






Probe set 






AFFX-DapX-3^t 


0.08 


1.49 


AFFX-PheX-3.at 


0.16 


1.55 


AFFX-LysX-3^t 


0.89 


2.46 


AFFX-ThrX-3jj,t 


0.22 


1.59 


AFFX-r2-TagA_at 


-1.05 


0.97 


AFFX-r2-TagE_at 


-0.32 


0.82 


AFFX-r2-TagB_at 


-0.51 


0.83 


AFFX-r2-TagF_at 


-0.46 


1.09 


AFFX-r2-TagC_at 


0.43 


1.08 


AFFX-r2-TagG_at 


-0.11 


0.90 


AFFX-r2-TagD_at 


-0.03 


0.90 


AFFX-r2-TagH_at 


0.11 


1.22 



from 0.125 pM to 512 pM in the spike-in experiment, 
are shown. The intensities measured at c = are taken 
as estimates of the background level /q in Eg. ljlOl) . In 
the collapse plots only the MM sequences for which a 
AG could be estimated are shown, as the mismatch free 
energies in RNA/DNA duplexes are known only for a 
limited set of mismatches [lj| (we could associate a free 
energy to about 30% of mismatches, as discussed in Ref. 

The IIGU133 spike-in set contains 4 bacterial se- 
quences and 8 artificial sequences (Fig. O and 30 hu- 
man sequences (Fig. [71and|Sl). A perfect agreement with 
the Langmuir theory would imply that the data all align 
along the curve given by Eq. IjlOfl and shown as a solid 
line in the Figs. El[Z|and|Sl In general the agreement is 
best for the artificial sequences. Occasionally, also some 
human sequences collapse well into a single curve in good 
agreement with the Langmuir model, but in general their 
behavior is worse than artificial ones. In order to measure 
the data dispersion we introduce the variable: 

».,„g(£), ,11) 

where / is the measured intensity and /th the theoretical 
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FIG. 7: Collapse plots for Human sequences of the HGU133 spike-in set (part 1). The probes which are complementary to 
targets which the largest folding free energies are emphasized (see Table ITTTl . They correspond to probes 204912_atl0 and 
204513_s_at4. 



value as predicted from the Langmuir isotherm (Eq. (|10|) ') 
for the x' corresponding to the measured /. For the defi- 
nition of w in Eq. 1)11(1 we have kept only the values of / 
in the range 100 < I < 10000. We determine its average 
(w) and standard deviation a^- If the data are well- 
centered around the expected behavior one has (w) = 0, 
while (Tiu is a measure of the spread in the data. 

The values of (w) and for the bacterial, artificial 
and human sequences are given in the tables^andm re- 
spectively. We note that ti^ is on average the lowest for 
the artificial sequences with typical value cr^ « 1. Only 
for two human probe sets (205790_at and 207540_s_at 
with w 0.7) the collapse is better than that of the arti- 
ficial sequences. For three human probe sets (204205_at, 
207641_at and 212827_at) the collapse is very poor as in- 
dicated by a (Ttu > 2. The collapses in the four bacterial 



sequences have somewhat higher dispersion compared to 
human sequences. 

A very interesting feature of the whole analysis is that 
the quality of collapses is much better for artificial se- 
quences than for any other sequence. Artificial sequences 
have been chosen by Affymetrix to be as different as pos- 
sible from any human RNA so to minimize the effects of 
cross-hybridization. Their preparation, as labeling and 
target fragmentation are concerned, is the same as for 
all other spikes |0. As in all collapses the same set of 
parameters is used, the high (t„ for some probe sets is 
very likely an indication that the selected probes are not 
yet optimal. Possible deviations from the theory are due 
to cross-hybridization. 
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FIG. 8: Collapse plots for Human sequences of the HGU133 spike-in set (part 2). The probes which axe complementary to 
targets which the largest folding free energies are emphasized (see Table ITTTll . They correspond to probes 207641_at5 and 
209354_at8. 



VI. DETERMINATION OF THE EXPRESSION 
LEVEL 



The model defined by Eqs. 1^ and ((SJ, once all param- 
eters have been fixed, can be used to fit the concentration 
c starting from the measured intensities. The target con- 
centration in solution is a measurement of the gene ex- 
pression level and it is the quantity one wants to compute 
from the raw microarray data. As the concentrations in 
the spike-in experiments are known, we can compare the 
known values with the fitted ones. Figure 1^1 shows a plot 
of fitted concentration vs. spike-in concentration for the 
artificial sequences. We limit ourselves here to show the 
data for these sequences, but the trend is quite general 
and valid for other genes as well. The solid line in Fig. 



corresponds to a line y = x, which means perfect agree- 
ment between spike-in and fitted values. The two other 
lines correspond to y = 2a; and y = x/2, drawn as a guide 
to the eye. 

As shown in Fig. |51 most of the data fall in the range 
between the two lines, except for the spikes TagA and 
TagF which give a much lower fitted concentration. All 
the points follow approximately straight lines with slope 
1, except for the highest spike-in concentrations, corre- 
sponding to 256 and 512 pM. This is due to the fact 
that at high concentrations many probes are very close 
to saturation. 

We note also that the fitted concentrations are all sys- 
tematically lower than the spike-in values, as most of the 
concentrations fall in the interval [cspiko-in/2, Cspike-in]- 
This is a consequence of our choice to use the fitting 
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TABLE II: List of values of (w) and for the human se- 
quences in the spike-in set HGU133. 



Probe set [w) 




Probe set 


(W) 




2UU0D0_S_at U.o4 


1 o^; 


OAKK^iA 

z0ooD9_at 


A OO 

-O.zo 


1 1 O 

i.iz 


2Uo4/i_S_at U.o9 


1 AO 


OAKCIAO ,^ .^4- 

z0aD9z_s_at 


A O /I 

U.z4 


1 OT 

1.2 ( 


2UooUo_at U.4o 


1 OQ 
i.OO 


OAC7AA 

zOo r 90_at 


A '70 

-0. /o 


A 

U. 10 


^U4zUo_at U.oD 


Oil 


OAflAfIA 

zUDUDU_S_at 


A C^O 

U.Oz 


i.DD 


2U44i/_at -U.24 


1 1 o 

i.io 


OATI ^iA 

zO ( iDO_at 


A OO 

-O.oz 


I.OD 


OA/1/10A .^J- A AO 

zu44ou_s_at -U.4o 


1.13 


OATC/IA ,-. 

207540_s_at 


-0.29 


0.62 


zU4oio_S_at -U.Do 


1 If; 
l.iD 


zU / D4i_at 


A O /I 

U.z4 


o 70 
Z. I z 


zU40Do_at -U.O/ 


1 A A 

1.44 


zO ( Dt30_s_at 


U. ( D 


1 At: 

l.Uo 


204836_at -0.04 


1.41 


207777_s_at 


-0.14 


1.11 


204912_at -0.31 


1.35 


207968_s_at 


-0.85 


1.66 


204951_at -0.15 


1.48 


209354_at 


0.04 


1.41 


204959_at 1.33 


1.62 


209606_at 


0.77 


1.44 


205267_at 0.36 


1.23 


209734_at 


-0.20 


1.51 


205291_at -0.44 


1.24 


209795_at 


0.63 


1.71 


205398_s_at -0.15 


1.37 


212827_at 


0.61 


2.53 


TABLE III: Minimal folding free energies for the targets (as- 
sumed to be 25-mers) complementary to the probes forming 
the spike-in HGU133 data set. These free energies are calcu- 
lated with the program RNAfold. 


Probe set 


Probe number -AG 


old (kcal/ 


mol) 


204513_s_at 




4 


8.70 




207641_at 




5 


8.16 




204430_s_at 




10 


7.79 




209354_at 




8 


7.67 




207540_s_at 




10 


7.45 




AFFX-r2-TagA_at 




1 


6.52 




205398_s_at 




1 


6.43 




AFFX-PheX-3^t 




10 


6.18 




204836_at 




10 


6.17 




203508_at 




2 


6.10 




206060_s_at 




3 


6.05 





parameters which wc took from a previous study T2| of 
spike- in experiments on the HGU95. We have chosen not 
to refit these parameters here again for HGU133, to illus- 
trate their universal validity. The slight underestimation 
of the absolute concentration is not a problem, since in 
gene expression measurements one is only interested in 
fold-variations of expression levels between different ex- 
perimental conditions. The fact that the data of Fig. |U| 
follow lines with a slope of approximately one guarantees 
that the fold-change in concentration in different experi- 
ments is correctly estimated. 
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FIG. 9: Plot of the fitted target concentration as a function 
of the spike-in concentration for the artificial sequences. The 
solid line correspond to the diagonal y = x, while the two 
dotted lines are y = x/2 and y = 2x and are drawn as guides 
to the eye. We note a systematic shift of the estimated abso- 
lute concentration compared to the spike-in one, although the 
fold-variations of the concentrations are correctly estimated 
as the majority of the data follow lines parallel to the diagonal 
in the plot. 



VII. ONE CAUSE OF OUTLIERS: TARGET 
SECONDARY STRUCTURES 

It is well-known that single stranded nucleic acids, par- 
ticularly RNA, tend to form stable folded conformations 
by binding of complementary bases. Currently, algo- 
rithms that calculate RNA secondary structures are to 
be trusted for sufficiently short molecules, say less than 
50 nucleotides, which is the situation of Affymetrix mi- 
croarrays, where RNA targets are fragmented before hy- 
bridization. The average target length is 50, but proba- 
bly only shorter fragment contribute to hybridization. 

We used the Vienna package '2o!| for the calculation 
of folded RNA structures that may form in solution and 
impede hybridization. We considered first 25-mer tar- 
gets in solution exactly complementary to the probes of 
the HGU133 spike-in data set. Table IITTI shows a list of 
probes in this set, whose complementary target has the 
lowest folding free energy, i.e. that of the most stable 
conformation, calculated at the experimental tempera- 
ture of 45° C. Given a folding free energy AGfoid, one 
can use the two state model approximation to find Pfoid 
the probability that the sequence is folded into the most 
stable conformation: 



Pfold 



-AGf„id/-RT 
p-AGtoid/-RT 



(12) 



where we use T — 45° C. According to this expres- 
sion for a folding free energy AGfoid = —8 kcal/mol one 
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FIG. 10; Folding configurations for the four targets with 
the lowest free energy. From left to right: 204513_s_at4, 
207641_at5, 204430_s_atl0 and 209354_at8. 



finds 1 — pfoid ~ 4 • 10~^ and AGfoid = —6 kcal/mol 
1 — Pfoid ~ 10^"*. Therefore the large majority of the tar- 
gets complementary to the probes listed in Table ITTll are 
folded and not expected to participate to hybridization. 

Figure El shows the folding configurations for the four 
targets with the lowest free energy of Table lllll As shown 
in Figs. E|and|Hlthe corresponding probes have a signal 
which is few order of magnitude lower than that expected 
from the Langmuir model, although not as low as derived 
from Eq. lO, using the AGfoid Hsted in Table |llll For 
instance, from the measured signals we find an intensity 
lower by a factor 10^ for the probe 204513_s_at4, instead 
of a factor 10^ as deduced from Eq. (|12|) . This difference 
could have several origins. First, the hybridization in so- 
lution described by the term a in Eq. |(2Jl may already 
take into account some secondary structure formation. 
Second, the RNA in solution is present with sequences 
of all lengths. The free energies listed in Table ITTll refer 
to 25-mers, so shorter sequences will have lower folding 
probability than that deduced from Eq. I|12() on the ba- 
sis of the free energies of 25-mers. Third, even if some 
secondary structure is present, hybridization with the 
surface-bound probes is still possible if the folded con- 
figuration has some dangling ends from which binding 
can initiate. 

We have analyzed the folding free energies of 25-mcrs 
complementary to all the probes in the HGU spike- in set. 
We found that about 50% of the targets have folding free 
energy lower than 1 kcal/mol, so that secondary structure 
formation can be safely neglected. About 10% of the 
targets have a folding free energy higher than 4 kcal/mol, 
so that for this fraction the secondary structure formation 
may interfere with the target-probe hybridization. 

Summarizing, the correct estimate of the folding prob- 
ability involves a complex calculation over fragments of 
all lengths, possibly including sequences neighboring the 
25-mer part complementary to the probe. However the 
folding is expected to have a relevant effect for at most 
10% of the probes. A possible way out is that of ex- 
cluding from the analysis of the gene expression levels 
those probes whose 25-mers folding free energy is above 
a certain threshold. 



VIII. CONCLUSION 

In this paper we have extended a previous study 
of Affymetrix spike- in experiments on the chip HGU95, 
to a novel HGU133 chipset. We used the model in- 
troduced in Ref. which takes into account both 
target-probe and target-target hybridization in solution. 
The hybridization free energies are calculated from the 
nearest-neighbor model IITII using the experimental pa- 
rameters for RNA/DNA [ll Hj and RNA/RNA 15|. 
There are four global fitting parameters in the model that 
we took from Ref. • We found that these parameters 
fit well also the current data on the HGU133 chipset, 
apart for a systematic small shift of all the estimates of 
the absolute target concentrations. 

There are several features that make the spike-in data 
of the more recent HGU133 chip interesting. First of all 
the spike-in set contains a larger number of sequences 
compared to the HGU95 experiments (42 instead of 14) 
and the chip has been entirely redesigned. Secondly, 
the spike-in sequences contain some of artificial origin, 
designed to avoid any cross hybridization with human 
RNAs, but prepared and labeled exactly as all other 
spikes. We find that these artificial sequences fit best 
the hybridization model, as they show the best collapses 
when the data are rescaled and plotted as function of an 
appropriate thermodynamic variable. The good agree- 
ment suggests indeed that the simple model describes 
rather well the hybridization in Affymetrix arrays and 
that the deviations observed for some human sequences 
are probably related to the non-optimal design of the se- 
quences for a given probe. 

When compared to the human sequences of the IIGU95 
spike- in experiments analyzed in Ref. '^^ thai 
the artificial spikes of the HGU133 set show definitely 
better collapses. However, when comparing the human 
sequences of the HGU133 with those in the HGU95 ex- 
periment we find on average a better collapse for the lat- 
ter. Only few probes out of the 32 human spikes of the 
HGU133 experiment have a better collapse than those of 
the HGU95. 

Interestingly, the physics-based modeling developed 
here allows to assign to each probe set a quality score 
based on the level of agreement on the Langmuir model. 
This information may be used to reconsider and eventu- 
ally redesign the probe sets of low quality. 

Finally, we have discussed the physical basis of hy- 
bridization in solution and of RNA secondary structure 
formation. The latter effect, according to the statistics 
over the spike-in probes, will be relevant for about 10% of 
the probes only. The sequences with the highest folding 
probability correspond to probes whose measured fluo- 
rescent intensities is well-below that predicted from the 
Langmuir model. 

According to our current understanding of the sys- 
tem (see also Refs. ^3), the hybridization in so- 
lution of partially complementary RNA molecules has 
a strong influence. One of the reasons for that is 
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that RNA/RNA interaction parameters are, at given 
temperature and salt concentration, stronger than the 
DNA/DNA or RNA/DNA parameters. The simple ap- 
proximation given in Eq.© captures the major features 
of the hybridization in solution. However, an improve- 
ment over this approach, as discussed above, remains an 



open challenge. 
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